Skip to content

Non-record: ASQU activation, Mixture of Convolutions, BankedLinear#679

Open
andrewmouldon wants to merge 7 commits intoopenai:mainfrom
andrewmouldon:main
Open

Non-record: ASQU activation, Mixture of Convolutions, BankedLinear#679
andrewmouldon wants to merge 7 commits intoopenai:mainfrom
andrewmouldon:main

Conversation

@andrewmouldon
Copy link

@andrewmouldon andrewmouldon commented Mar 25, 2026

Explores architectural changes aimed at improving capacity per parameter, even if slower than typical leaderboard approaches:

  • ASQU (Asymmetric Squared Unit): learned per-channel generalization of ReLU^2
  • MoC (Mixture of Convolutions): token-conditioned mixture of short convolutions (dynamic per-token kernels)
  • BankedLinear: shared weight bank across layers (small learned + large fixed random set) with learned per-layer mixing

Ablations use the base train_gpt.py script for a fixed 10k steps. MLP expansion is adjusted to match size.

  • ASQU replaces ReLU^2
  • Short conv / MoC is applied to QKV
  • BankedLinear replaces QKV projections

Results

Model Variant Pre-quant BPB Post-quant BPB Size (bytes) MLP Mult
Baseline 1.2262 1.2328 15861272 2.00
+ ASQU 1.2232 1.2301 15898146 2.00
+ Short Conv (k=1) 1.2157 1.2217 15973462 1.99
+ MoC (k=8) 1.2121 1.2182 15911167 1.93
+ BankedLinear 1.2098 1.2164 15852659 2.6

Results are currently single-seed (1337); additional runs in progress.

ASQU: per-channel parameterization gives ~0.001 bpb improvement over scalar, where the scalar form converges to behavior similar to leaky ReLU^2 with slope 0.5

Also explored learning the exponent instead of fixing the square. While it did not consistently improve final performance (and was more expensive), it revealed a consistent depth-dependent pattern:

  • early layers ~1.4
  • middle layers ~1.8
  • late layers ~2.2

With MoC, experiments of generating the dynamic kernel via learned projection performed poorly, suggesting the use of a more constrained mechanism (e.g. basis interpolation) is necessary for stable optimization.

For BankedLinear, one scalar weight is used per bank entry to construct the mixture. Experiments with per head weights worsened performance.

Experiments were done with a layer specific LoRA, but this worsened performance compared to just investing capacity in the MLP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant